# 🧩 Agent-ScanKit

## Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Agent-ScanKit is a systematic probing toolkit designed to evaluate and disentangle **memory-driven vs. reasoning-driven behaviors** in multimodal GUI agents.

It provides a unified pipeline for:

1. Preparing foundation models
2. Preprocessing benchmark datasets
3. Running baseline evaluations across 5 datasets
4. Conducting fine-grained probing via visual, textual, and structural perturbations

### 📦 Installation

```
git clone https://github.com/xxx/Agent-ScanKit.git
cd Agent-ScanKit
pip install -r requirements.txt
```

### ⚙️ Model Preparation

Before running evaluations, download or link the models to be tested.
We support a range of GUI agents (OS-Atlas, Aguvis, UI-TARS, GUI-R1, GUI-Owl, AgentCPM, etc.).
You may need to adjust configuration paths in bash/evaluation.sh.

### 📂 Dataset Preprocessing

Agent-ScanKit supports five benchmark datasets:

* AndroidControl
* AITZ
* GUI-Odyssey
* GUI-Act (Mobile/Web
* OmniAct (Web/Desktop)

Preprocess datasets into a unified format according to each agents:

```
bash bash/preprocess_ac.sh --dataset ac
bash bash/preprocess_[probing].sh --method visual
```

📊 Comprehensive Evaluation

To measure existing performance, run the base evaluation across all five datasets.
This provides task success rates and step-wise accuracy before perturbations.

```
bash bash/evaluation.sh --model <model_name> --dataset <dataset_name>
```

Example:

```
bash scripts/run_evaluation.sh --model UI-TARS-7B-SFT --dataset androidcontrol
```

Results will be logged in results/datasetName/.

🔍 Probing Experiments
After establishing baselines, run sensitivity perturbations:

* Visual Probing: Mask or occlude target regions.

```
bash bash/visual_guided_probing.sh --model <model_name> --dataset <dataset_name>
```

* Textual Probing: Modify instructions or perturb vocabulary space.

```
bash bash/textual_guided_probing.sh --model <model_name> --dataset <dataset_name>
```

* Structural Probing: Alter state/action sequences (e.g., back/complete/wait).

```
bash bash/structure_guided_probing.sh --model <model_name> --dataset <dataset_name>
```

